{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Scalable GP Regression (CUDA) with Fast Predictive Distributions (KISS-GP/LOVE) \n",
    "\n",
    "\n",
    "## Introduction\n",
    "\n",
    "In this notebook, we'll give a brief tutorial on how to use Lanczos Variance Estimates (LOVE) to achieve fast predictive distributions as described in this paper https://arxiv.org/abs/1803.06058. To see LOVE in use with exact GPs, see `fast_variances_exact_(LOVE).ipynb`. For a tutorial on using the fast sampling mechanism described in the paper, see `fast_sampling_ski_(LOVE).ipynb`.\n",
    "\n",
    "LOVE is an algorithm for approximating the predictive covariances of a Gaussian process in constant time after linear time precomputation. In this notebook, we will train a deep kernel learning model with SKI on one of the UCI datasets used in the paper, and then compare the time required to make predictions with each model.\n",
    "\n",
    "**NOTE**: The timing results reported in the paper compare the time required to compute (co)variances __only__. Because excluding the mean computations from the timing results requires hacking the internals of GPyTorch, the timing results presented in this notebook include the time required to compute predictive means, which are not accelerated by LOVE. Nevertheless, as we will see, LOVE achieves impressive speed-ups."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import math\n",
    "import torch\n",
    "import gpytorch\n",
    "from matplotlib import pyplot as plt\n",
    "\n",
    "# Make plots inline\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading Data\n",
    "\n",
    "For this example notebook, we'll be using the `elevators` UCI dataset used in the paper. Running the next cell downloads a copy of the dataset that has already been scaled and normalized appropriately. For this notebook, we'll simply be splitting the data using the first 80% of the data as training and the last 20% as testing.\n",
    "\n",
    "**Note**: Running the next cell will attempt to download a ~400 KB dataset file to the current directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import urllib.request\n",
    "import os.path\n",
    "from scipy.io import loadmat\n",
    "from math import floor\n",
    "\n",
    "if not os.path.isfile('elevators.mat'):\n",
    "    print('Downloading \\'elevators\\' UCI dataset...')\n",
    "    urllib.request.urlretrieve('https://drive.google.com/uc?export=download&id=1jhWL3YUHvXIaftia4qeAyDwVxo6j1alk', 'elevators.mat')\n",
    "    \n",
    "data = torch.Tensor(loadmat('elevators.mat')['data'])\n",
    "X = data[:, :-1]\n",
    "X = X - X.min(0)[0]\n",
    "X = 2 * (X / X.max(0)[0]) - 1\n",
    "y = data[:, -1]\n",
    "\n",
    "# Use the first 80% of the data for training, and the last 20% for testing.\n",
    "train_n = int(floor(0.8*len(X)))\n",
    "\n",
    "train_x = X[:train_n, :].contiguous().cuda()\n",
    "train_y = y[:train_n].contiguous().cuda()\n",
    "\n",
    "test_x = X[train_n:, :].contiguous().cuda()\n",
    "test_y = y[train_n:].contiguous().cuda()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Defining the DKL Feature Extractor\n",
    "\n",
    "Next, we define the deep feature extractor we'll be using for DKL. In this case, we use a fully connected network with the architecture `d -> 1000 -> 500 -> 50 -> 2`, as described in the original DKL paper. All of the code below uses standard PyTorch implementations of neural network layers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_dim = train_x.size(-1)\n",
    "\n",
    "class LargeFeatureExtractor(torch.nn.Sequential):           \n",
    "    def __init__(self):                                      \n",
    "        super(LargeFeatureExtractor, self).__init__()        \n",
    "        self.add_module('linear1', torch.nn.Linear(data_dim, 1000))\n",
    "        self.add_module('relu1', torch.nn.ReLU())                  \n",
    "        self.add_module('linear2', torch.nn.Linear(1000, 500))     \n",
    "        self.add_module('relu2', torch.nn.ReLU())                  \n",
    "        self.add_module('linear3', torch.nn.Linear(500, 50))       \n",
    "        self.add_module('relu3', torch.nn.ReLU())                  \n",
    "        self.add_module('linear4', torch.nn.Linear(50, 2))         \n",
    "                                                             \n",
    "feature_extractor = LargeFeatureExtractor().cuda()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Defining the GP Model\n",
    "\n",
    "We now define the GP model. For more details on the use of GP models, see our simpler examples. This model uses a `GridInterpolationKernel` (SKI) with an RBF base kernel. The forward method passes the input data `x` through the neural network feature extractor defined above, scales the resulting features to be between 0 and 1, and then calls the kernel."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "class GPRegressionModel(gpytorch.models.ExactGP):\n",
    "    def __init__(self, train_x, train_y, likelihood):\n",
    "        super(GPRegressionModel, self).__init__(train_x, train_y, likelihood)\n",
    "        \n",
    "        self.mean_module = gpytorch.means.ConstantMean()\n",
    "        self.covar_module = gpytorch.kernels.GridInterpolationKernel(\n",
    "            gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel()),\n",
    "            grid_size=100, num_dims=2,\n",
    "        )\n",
    "        \n",
    "        # Also add the deep net\n",
    "        self.feature_extractor = feature_extractor\n",
    "\n",
    "    def forward(self, x):\n",
    "        # We're first putting our data through a deep net (feature extractor)\n",
    "        # We're also scaling the features so that they're nice values\n",
    "        projected_x = self.feature_extractor(x)\n",
    "        projected_x = projected_x - projected_x.min(0)[0]\n",
    "        projected_x = 2 * (projected_x / projected_x.max(0)[0]) - 1\n",
    "        \n",
    "        # The rest of this looks like what we've seen\n",
    "        mean_x = self.mean_module(projected_x)\n",
    "        covar_x = self.covar_module(projected_x)\n",
    "        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)\n",
    "\n",
    "    \n",
    "likelihood = gpytorch.likelihoods.GaussianLikelihood()\n",
    "model = GPRegressionModel(train_x, train_y, likelihood).cuda()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the model\n",
    "\n",
    "The cell below trains the DKL model above, finding optimal hyperparameters using Type-II MLE. We run 20 iterations of training using the `Adam` optimizer built in to PyTorch. With a decent GPU, this should only take a few seconds.\n",
    "\n",
    "It's good to add some L2 regularization to the feature extractor part of the model, but NOT to any other part of the model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/jrg365/gpytorch/gpytorch/utils/cholesky.py:14: UserWarning: torch.potrf is deprecated in favour of torch.cholesky and will be removed in the next release. Please use torch.cholesky instead and note that the :attr:`upper` argument in torch.cholesky defaults to ``False``.\n",
      "  potrf_list = [sub_mat.potrf() for sub_mat in mat.view(-1, *mat.shape[-2:])]\n",
      "/home/jrg365/gpytorch/gpytorch/lazy/added_diag_lazy_tensor.py:66: UserWarning: torch.potrf is deprecated in favour of torch.cholesky and will be removed in the next release. Please use torch.cholesky instead and note that the :attr:`upper` argument in torch.cholesky defaults to ``False``.\n",
      "  ld_one = lr_flipped.potrf().diag().log().sum() * 2\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Iter 1/40 - Loss: 0.950\n",
      "Iter 2/40 - Loss: 0.936\n",
      "Iter 3/40 - Loss: 0.928\n",
      "Iter 4/40 - Loss: 0.921\n",
      "Iter 5/40 - Loss: 0.915\n",
      "Iter 6/40 - Loss: 0.910\n",
      "Iter 7/40 - Loss: 0.905\n",
      "Iter 8/40 - Loss: 0.899\n",
      "Iter 9/40 - Loss: 0.894\n",
      "Iter 10/40 - Loss: 0.889\n",
      "Iter 11/40 - Loss: 0.884\n",
      "Iter 12/40 - Loss: 0.879\n",
      "Iter 13/40 - Loss: 0.874\n",
      "Iter 14/40 - Loss: 0.869\n",
      "Iter 15/40 - Loss: 0.864\n",
      "Iter 16/40 - Loss: 0.859\n",
      "Iter 17/40 - Loss: 0.853\n",
      "Iter 18/40 - Loss: 0.848\n",
      "Iter 19/40 - Loss: 0.843\n",
      "Iter 20/40 - Loss: 0.840\n",
      "Iter 21/40 - Loss: 0.834\n",
      "Iter 22/40 - Loss: 0.829\n",
      "Iter 23/40 - Loss: 0.820\n",
      "Iter 24/40 - Loss: 0.815\n",
      "Iter 25/40 - Loss: 0.810\n",
      "Iter 26/40 - Loss: 0.806\n",
      "Iter 27/40 - Loss: 0.801\n",
      "Iter 28/40 - Loss: 0.796\n",
      "Iter 29/40 - Loss: 0.790\n",
      "Iter 30/40 - Loss: 0.785\n",
      "Iter 31/40 - Loss: 0.780\n",
      "Iter 32/40 - Loss: 0.775\n",
      "Iter 33/40 - Loss: 0.769\n",
      "Iter 34/40 - Loss: 0.764\n",
      "Iter 35/40 - Loss: 0.759\n",
      "Iter 36/40 - Loss: 0.754\n",
      "Iter 37/40 - Loss: 0.754\n",
      "Iter 38/40 - Loss: 0.748\n",
      "Iter 39/40 - Loss: 0.739\n",
      "Iter 40/40 - Loss: 0.735\n",
      "CPU times: user 6.73 s, sys: 2.3 s, total: 9.04 s\n",
      "Wall time: 9.02 s\n"
     ]
    }
   ],
   "source": [
    "# Find optimal model hyperparameters\n",
    "model.train()\n",
    "likelihood.train()\n",
    "\n",
    "# Use the adam optimizer\n",
    "# Add weight decay to the feature exactractor ONLY\n",
    "optimizer = torch.optim.Adam([\n",
    "    {'params': model.mean_module.parameters()},\n",
    "    {'params': model.covar_module.parameters()},\n",
    "    {'params': model.likelihood.parameters()},\n",
    "    {'params': model.feature_extractor.parameters(), 'weight_decay': 1e-3}\n",
    "], lr=0.01)\n",
    "\n",
    "# \"Loss\" for GPs - the marginal log likelihood\n",
    "mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)\n",
    " \n",
    "def train(training_iterations=40):\n",
    "    for i in range(training_iterations):\n",
    "        optimizer.zero_grad()\n",
    "        output = model(train_x)\n",
    "        loss = -mll(output, train_y)\n",
    "        loss.backward()\n",
    "        print('Iter %d/%d - Loss: %.3f' % (i + 1, training_iterations, loss.item()))\n",
    "        optimizer.step()\n",
    "        \n",
    "# Sometimes we get better performance on the GPU when we don't use Toeplitz math\n",
    "# for SKI. This flag controls that\n",
    "with gpytorch.settings.use_toeplitz(False):\n",
    "    %time train()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Make Predictions using Standard SKI Code\n",
    "\n",
    "The next cell gets the predictive covariance for the test set (and also technically gets the predictive mean, stored in `preds.mean()`) using the standard SKI testing code, with no acceleration or precomputation. \n",
    "\n",
    "**Note:** Full predictive covariance matrices (and the computations needed to get them) can be quite memory intensive. Depending on the memory available on your GPU, you may need to reduce the size of the test set for the code below to run. If you run out of memory, try replacing `test_x` below with something like `test_x[:1000]` to use the first 1000 test points only, and then restart the notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/jrg365/gpytorch/gpytorch/utils/cholesky.py:14: UserWarning: torch.potrf is deprecated in favour of torch.cholesky and will be removed in the next release. Please use torch.cholesky instead and note that the :attr:`upper` argument in torch.cholesky defaults to ``False``.\n",
      "  potrf_list = [sub_mat.potrf() for sub_mat in mat.view(-1, *mat.shape[-2:])]\n",
      "/home/jrg365/gpytorch/gpytorch/lazy/added_diag_lazy_tensor.py:66: UserWarning: torch.potrf is deprecated in favour of torch.cholesky and will be removed in the next release. Please use torch.cholesky instead and note that the :attr:`upper` argument in torch.cholesky defaults to ``False``.\n",
      "  ld_one = lr_flipped.potrf().diag().log().sum() * 2\n"
     ]
    }
   ],
   "source": [
    "import time\n",
    "\n",
    "# Set into eval mode\n",
    "model.eval()\n",
    "likelihood.eval()\n",
    "with torch.no_grad(), gpytorch.settings.use_toeplitz(False):\n",
    "    start_time = time.time()\n",
    "    preds = model(test_x[:1000])\n",
    "    exact_covar = preds.covariance_matrix\n",
    "    exact_covar_time = time.time() - start_time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Time to compute exact mean + covariances: 0.60s\n"
     ]
    }
   ],
   "source": [
    "print('Time to compute exact mean + covariances: {:.2f}s'.format(exact_covar_time))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Clear Memory and any Precomputed Values\n",
    "\n",
    "The next cell clears as much as possible to avoid influencing the timing results of the fast predictive variances code. Strictly speaking, the timing results above and the timing results to follow should be run in entirely separate notebooks. However,  this will suffice for this simple example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "GaussianLikelihood()"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Clear as much 'stuff' as possible\n",
    "import gc\n",
    "gc.collect()\n",
    "torch.cuda.empty_cache()\n",
    "model.train()\n",
    "likelihood.train()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compute Predictions with LOVE, but Before Precomputation\n",
    "\n",
    "Next we compute predictive covariances (and the predictive means) for LOVE, but starting from scratch. That is, we don't yet have access to the precomputed cache discussed in the paper. This should still be faster than the full covariance computation code above.\n",
    "\n",
    "In this simple example, we allow a rank 10 root decomposition, although increasing this to rank 20-40 should not affect the timing results substantially."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/jrg365/gpytorch/gpytorch/utils/cholesky.py:14: UserWarning: torch.potrf is deprecated in favour of torch.cholesky and will be removed in the next release. Please use torch.cholesky instead and note that the :attr:`upper` argument in torch.cholesky defaults to ``False``.\n",
      "  potrf_list = [sub_mat.potrf() for sub_mat in mat.view(-1, *mat.shape[-2:])]\n",
      "/home/jrg365/gpytorch/gpytorch/lazy/added_diag_lazy_tensor.py:66: UserWarning: torch.potrf is deprecated in favour of torch.cholesky and will be removed in the next release. Please use torch.cholesky instead and note that the :attr:`upper` argument in torch.cholesky defaults to ``False``.\n",
      "  ld_one = lr_flipped.potrf().diag().log().sum() * 2\n"
     ]
    }
   ],
   "source": [
    "# Set into eval mode\n",
    "model.eval()\n",
    "likelihood.eval()\n",
    "with torch.no_grad(), gpytorch.settings.use_toeplitz(False), gpytorch.settings.fast_pred_var(), gpytorch.settings.max_root_decomposition_size(10):\n",
    "    start_time = time.time()\n",
    "    preds = model(test_x[:1000])\n",
    "    fast_time_no_cache = time.time() - start_time"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compute Predictions with LOVE After Precomputation\n",
    "\n",
    "The above cell additionally computed the caches required to get fast predictions. From this point onwards, unless we put the model back in training mode, predictions should be extremely fast. The cell below re-runs the above code, but takes full advantage of both the mean cache and the LOVE cache for variances."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "with torch.no_grad(), gpytorch.settings.use_toeplitz(False), gpytorch.settings.fast_pred_var(), gpytorch.settings.max_root_decomposition_size(10):\n",
    "    start_time = time.time()\n",
    "    preds = model(test_x[:1000])\n",
    "    fast_covar = preds.covariance_matrix\n",
    "    fast_time_with_cache = time.time() - start_time"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Time to compute mean + covariances (no cache) 0.37s\n",
      "Time to compute mean + variances (cache): 0.04s\n"
     ]
    }
   ],
   "source": [
    "print('Time to compute mean + covariances (no cache) {:.2f}s'.format(fast_time_no_cache))\n",
    "print('Time to compute mean + variances (cache): {:.2f}s'.format(fast_time_with_cache))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Compute Error between Exact and Fast Variances\n",
    "\n",
    "Finally, we compute the mean absolute error between the fast variances computed by LOVE (stored in fast_covar), and the exact variances computed previously. \n",
    "\n",
    "Note that these tests were run with a root decomposition of rank 10, which is about the minimum you would realistically ever run with. Despite this, the fast variance estimates are quite good. If more accuracy was needed, increasing `max_root_decomposition_size` to 30 or 40 would provide even better estimates."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "MAE between exact covar matrix and fast covar matrix: 4.553118742478546e-06\n"
     ]
    }
   ],
   "source": [
    "print('MAE between exact covar matrix and fast covar matrix: {}'.format((exact_covar - fast_covar).abs().mean()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
